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Abstract 

Consider a genetic locus carrying a strongly beneficial allele which has recently fixed in a 
large population. As strongly beneficial alleles fix quickly, sequence diversity at partially 
linked neutral loci is reduced. This phenomenon is known as a selective sweep. 

The fixation of the beneficial allele not only affects sequence diversity at single neutral 
loci but also the joint allele distribution of several partially linked neutral loci. This dis- 
tribution can be studied using the ancestral recombination graph for samples of partially 
linked neutral loci during the selective sweep. To approximate this graph, we extend 
recent work by [SD05, EPW06] using a marked Yule tree for the genealogy at a single 
neutral locus linked to a strongly beneficial one. 

We focus on joint genealogies at two partially linked neutral loci in the case of large se- 
lection coefficients a and recombination rates p = Oiaj log a) between loci. Our approach 
leads to a full description of the genealogy with accuracy of C((log a) -2 ) in probability. 
As an application, we derive the expectation of Lewontin's D as a measure for non-random 
association of alleles. 

1 Introduction 

The model of selective sweeps, also known as genetic hitchhiking, introduced by Maynard- 
Smith and Haigh in [MSH74], is the starting point for a large body of both empirical and 
theoretical population genetic studies ([Nur05]). It predicts that sequence diversity is reduced 
close to a strongly selected locus on a recombining genome near the time of fixation of the 
beneficial allele. Theoretical studies aim at describing these patterns of genetic diversity in 
detail while empirical work uses this prediction to identify genes under selection. 

If a species or a population adapts to its environment, several genes might be under strong 
selection. Moreover, if the function of genes were known, we would have predictions as to 
which genes are responsible for the adaptive process. Unfortunately, functional information 
is scarce. Without functional knowledge and in the presence of recombination, the model 
of selective sweeps helps to identify candidate genes affected by recent selective pressures. 
Genome scans are carried out for a sample of individuals, which show patterns of sequence 
diversity at lots of marker loci in the whole genome ([NWK + 05]). If a marker shows low 
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diversity, statistical tests help to decide if a gene under selection is located nearby ([KS02, 
LS05]). 

Most theoretical studies of selective sweeps have focused on a model with one selective and 
one partially linked neutral locus ([MSH74, SWL92, KHL89, Bar98, SD05, EPW06]). This 
simple model already describes the reduction in sequence diversity. However genetic data 
are frequently available for many partially linked loci. This raises the question of whether 
selective sweeps also generate distinct patterns of multi-locus allele frequencies. We will follow 
[SSL06] and study a three locus model with one selective and two partially linked neutral loci. 
Using this model, it is possible to study the non-random association of allelic types at the 
two neutral loci, which is usually called linkage disequilibrium. 

An influential idea in the analysis of selective sweeps was to study approximate genealogies 
describing relationships between the individuals in a sample from the population. Studying 
genealogies at the selected site started with [KDH88] and was carried further to linked neutral 
loci in [KHL89] . 

The genealogy at a single neutral locus can be constructed as a structured coalescent. 
Here, the beneficial and wild-type allele at the selected locus form two subpopulations. Their 
sizes are determined by the frequency path of the beneficial allele during the selective sweep. 
Assume a new gamete is built (forward in time) by recombination of a beneficial allele at 
the selected locus and a neutral variant linked to a wild-type. Following the neutral variant 
backward in time leads to a migration event from the beneficial to the wild-type background. 
Therefore, recombination acts as migration between the beneficial and the wild-type back- 
grounds. 

Genealogies of two or more loci can be constructed using the ancestral recombination graph 
([Hud83, GM97]). Therefore, we will construct ancestries of two partially linked neutral loci 
under a selective sweep by a structured ancestral recombination graph. As in the case of only 
one locus, the two subpopulations are distinguished by the beneficial and wild-type allele at 
the selected locus, respectively. This ancestral recombination graph will serve as the exact 
model for genealogies at partially linked loci under a selective sweep. However, an exact 
analysis is hard to obtain, because the graph must be conditioned on the random frequency 
path of the beneficial allele. 

An alternative approach uses a two-step procedure for genealogies at the selective and 
the neutral locus. First, the (approximate) genealogy at the selective locus is generated and 
second, the genealogy at the neutral locus is added, which might differ due to recombination. 
Two approximate genealogies at the selected site have been proposed. First, a star-like 
genealogy, which means that the most recent common ancestor of all pairs in the population 
is the individual which carried the beneficial allele first ([SD05, NWK + 05]). Second, a Yule 
process, i.e., a pure birth process, which allows for coalescences also during the selective 
sweep ([SD05, EPW06]). It was shown in [SD05, Theorems 1.1, 1.2] that the Yule process 
approximation is more exact than the star-like approximation. Therefore, we will use this Yule 
process approximation for the genealogy at the selected site to study the three locus model of 
[SSL06] for selective sweeps. We will show that the analysis carried out in [EPW06] in the two 
locus case can be extended to the three locus case (Theorem 1). Moreover, the approximation 
by a Yule process can be used to calculate characteristics of linkage disequilibrium explicitly 
(Theorem 2). 
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Figure 1: The two possible geometries of the selected (S) and the two neutral loci (L and R). 
The scaled recombination rates between loci are given by psLi pLR, pls and psr- 



2 The model 

Consider a beneficial allele which enters a population of (haploid) size iV at time t = and 
has a selective advantage of s with respect to the wild-type allele. Set a = sN, which is called 
the scaled selection coefficient. As selection can only be detected if the beneficial allele fixes 
in the population, we condition on fixation of the beneficial allele and let T be the (random) 
time of fixation. 

Assume reproduction in the population follows a Wright-Fisher model, or, more generally, 
a Cannings model with individual offspring variance 1. In the limit of infinite N and a time 
rescaling in units of N generations, the frequency path of the beneficial allele is the solution 
of the SDE 

dX = aX(l - X) coth(aX)dt + y/X(l - XjdW, (2.1) 

with a standard Brownian motion W and Xq = 0. This diffusion arises as /i-transform of 
the process describing the unconditional frequency path with the fixation probability of the 
beneficial allele as a harmonic function and has as an entrance boundary. (See e.g. [Gri03], 
p. 245 and [EPW06], (2.1).) 

Two neutral loci are partially linked to the selected locus. For simplicity, we refer to the 
two neutral loci as the leit and right neutral locus, denoted by L and R. As illustrated in 
Figure 1, the selected locus lies either (i) outside or (ii) in between the neutral loci. All other 
possible geometries are equivalent to either (i) or (ii) because of the symmetry in the model. 

Recombination can break up the association of these three loci. (We only consider recom- 
bination as simple crossing over. Gene conversion is not considered in our model.) As we take 
a limiting infinite population and rescale time by a factor of N, we have to consider scaled 
recombination rates. These are different for the two geometries. For geometry (i) we denote 
the recombination rates between the selective and neutral loci by psi, Plr arid for geometry 
(ii) by p LS , psr respectively. 

The two linked neutral loci do not affect the frequency path of the beneficial allele. In 
contrast, neutral variants which are linked to the beneficial allele at the beginning of the 
selective sweep rise in frequency. Looking backward in time from the time T of fixation, we 
can trace back the history of a finite sample at all three loci. As the neutral loci are linked 
to the selected one, the genealogies at all three loci are correlated. 

For the construction of the ancestral recombination graph relating all loci, time is running 
backward, so we set (3 = T — t. Conditioned on a frequency path X = (X t )o<t<T, given 
by (2.1), we will describe the ancestral recombination graph as a partition- valued process 
£ = )o</3<T- 

Assume we take a sample from the population at time T. Every individual in the sample 
carries one L and one R- locus. Of all L- and i?-loci present in the sample we want to trace 
back a number I of L- and r of R-\oc\. These loci are represented by sets i for the L- and 
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r for the .R-loci. So, I := \£\,r := \r\. To define the state space of the structured ancestral 
recombination graph denote by Va the set of partitions of A for a finite set A and define 

Viur ■= {£ = ue 6 g v^ B ne = 0}. 

The coordinates £ B and £ b contain partition elements located in the beneficial and the wild- 
type background, respectively. For £ E V iUr we write for the partition element containing 
j G £ U r. 

The ancestral process is started at the time f3 = of fixation of the beneficial allele. So, 
the sample of L- and i?-loci is linked to the beneficial allele. Therefore, we start the process 
in = (vr, 0) for some 7r G Viur- Assume the state at time {3 is = {^ B , £ b ) € V' £Ur . For 
j 6lUr the partition element which contains j G £, i.e., encodes the set of L- and 

i?-loci, taken from the population at time T, which have the same ancestor as j at time T — (3. 
Usually we will study the genealogy of n pairs of L- and i?-loci. In this case set £ := {1, . . . , n} 
and r := {n + 1, . . . , 2n} and start the process with ir = {{1, n + 1}, . . . , {n, 2n}}. 

The dynamics of the process is given as follows: Coalescence events occur for lines in 
the beneficial and the wild- type background with pair coalescence rate 1/Xx-p and 1/(1 — 
X T _p) at time /3, respectively. So, given = {£, B ,£ b ) with £ B = {ff , . . ■ ,Cifsi} and ^ = 
■ ■ ■ ,&b\] transitions occur for 1 < j / k < \£ B \ and 1 < j' ^ k' < \£ h \ from (£ B ,£ b ) to 

((£ B \ {*f }) U tff Uff },**)) with rate ^— , (1) 

T 1 (2-2) 
(Z B , (C b \ {#,&}) U {# U &})) with rate (2) 

respectively. For transitions in the process £ x due to recombination we focus on geometry 
(i) first. A recombination event hits one line between the S and the L locus with rate psl 
and between the L and the R locus with rate plr- If a recombination event occurs between 
the S and the L locus, it may be that both recombining chromosomes carry the same allele 
at the S locus. This gives a recombination event which cannot be seen effectively and we 
ignore it in the process ^ x . All other recombination events must be modeled. If = (£ B , 
with £ B = {£ B , . . . , £i|b|} and £ b = ■ • • >£|£&i}> transitions occur for 1 < j < \^ B \ and 
1 < k < \£ b \ from (^ B ,C b ) to 
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Here, (3j) encodes a recombination event which takes a pair of linked L- and -R-loci from the 
beneficial to the wild-type background; an event (4j) separates the -R-locus of a line and takes 
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it to the wild-type background; by (5j) the L and R loci of a line in the beneficial background 
are split but remain both in the same background; (6j) describes the same transition for a line 
in the wild-type background. The transitions (7j) and (8j) describe the back-recombination 
of loci into the beneficial background. 

Example 2.1. An example displaying the dynamics of the process £ x for geometry (i) is 
shown in Figure 2. The sets of L- and i?-loci are £ = {1, 2, 3} and r = {4, 5, 6}, respectively. 
The starting partition is ^ x = (it, 0) with ir = {{1, 4}, {2, 5}, {3, 6}}. Several kinds of events 
can happen; coalescences in the beneficial background, i.e., an event (1), recombinations which 
leave the two neutral loci together but change the allele at the selected site, i.e., an event 
(3j) and recombination events which split the two neutral loci. The last kind of event may 
either bring one of the two neutral loci in a different background, (4j), or split a line within 
the beneficial background, (5j), or split a line in the wild-type background, (6j). The final 
partition is £* = with $ = {{1,2}}, # = {{3}, {4}, {5}, {6}}. 

For geometry (ii) we have (rescaled) recombination rates pis an d Psr between the left 
neutral and the selective and the right and the selective locus, respectively. Here, transitions 
occur from to 



(£ B \{£?})u{£f nr},£ 6 u{£f n£}) 


with rate Pls(^ — Xt—p) 


(3ii) 


(£*\{£f})u{£f n*U 6 uftf nr}) 


with rate - Xr-^a) 


(4u) 


(£ B \{£f})u{£f n4£f nr},£ 6 ) 


with rate (p LS + psr)X T -/3 


(5i<) 


e B ,(e 6 \{^})u{^n£,^nr}) 


with rate (p L5 + psh)(1 - Xr.^) 


(6ii) 


e B u{^n£},(e & \{d})u{^,nr}) 


with rate plsXt-/3 


(7«) 


e fl u{^r},«*\{tf})u{3n*}) 


with rate psrX t _/3. 


(8ii) 



(2.4) 



These events refer to a change in background from the beneficial to the wild-type background 
either for the L-locus, (3a), or the -R-locus, (4jj). Splits in the beneficial and wild-type 
background may happen as in the case of geometry (i); see events (5m) and (6 M -). Back- 
recombinations to the beneficial background are denoted by (7 M ) for the L- and (8u) for the 
i?-locus. Observe that a transition which takes both loci on one line from the beneficial to 
the wild- type background cannot occur for geometry (ii); cf. event (3»). 

Definition 2.2. Assume £ and r are sets of left and right neutral loci, respectively, and 
X = (Xt)o<t<T is a frequency path of the beneficial allele given by (2.1). 

Conditioned on X, consider the jump process £ x = (£g0o</3<T> which starts in £ x = 
(-7T, 0) for 7r G Vojr and makes transitions by coalescence events (1), (2), given by (2.2) and 
recombination events (3i)-(8i) or (3u)-(8a) from (2.3) and (2.4), respectively. This process 
^ x is denoted the structured ancestral recombination graph for the L and R locus conditioned 
on X for geometry (i) or (ii), respectively. 

The mixture of £ x over the distribution of frequency paths given by (2.1) defines the 
random partition T w = (r^,r^.), i.e., 



r w := / i x ¥[dX] 
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Figure 2: A structured ancestral recombination graph £ x conditioned on the frequency path 
X of the beneficial allele. Between times [3 = and [3 = T coalescences may occur at rates 
(1) and (2). Recombination events happen at rates (3j) — (8j). The dashed lines indicate 
ancestry of the L- locus while the R- locus may be traced along dotted lines. 



3 Main result 

We study selective sweeps in the infinite population limit, i.e., the frequency of the beneficial 
allele follows the SDE given by (2.1). Moreover, selection is most efficient for large selection 
coefficients. Our goal is to derive a simpler but approximate expression for T n in the regime 
of large a. It was shown in [EPW06] that for the fixation time T of the beneficial allele 

m = 2}oga +o n\ Y[T]=0(^) (3.1) 



a \a 

for large a. This suggests that only under the scaling p = O (a /log a) for the recombination 
rate a non-trivial number of recombination events occurs during the sweep for large a. This 
is true for all possible kinds of recombination events during the sweep, so the recombination 
rates psl,Plr and pls-,Psr for geometries (i) and (ii) should be of this order. Henceforth, 
we assume 

„ ... a a 

Geometry (l): p S L = JSL, , PLR = 1LR-, , < JSL,JLR < °o 

log a log a 

.... a a 

Geometry (n): p L s = 1LS-, , PSR = 1SR-, , < JLS,JSR < °o. 

log a log a 

Our approximation of IV is based on a Yule tree, which serves as an approximation of the 
genealogy at the selected locus. A Yule tree is the realization of a Yule process, i.e., a pure 
birth process which starts with one line and every line splits in two lines after an exponential 
waiting time. 
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In our approximation the quantity 

to i=ti+i 

will play an important role. 

Assume £ and r are sets of left and right loci and ir 6 'Pajr- Three mechanisms determine 
the Yule approximation of the partition T n . First, we approximate splits in the beneficial 
background, i.e., events (5j) and (5u), by the following procedure: 

For all partition elements ni, . . . ,TT\ n i realize Bernoulli random variables 
Ui, . . . , U\ n \ which are 1 with success probability 

geometry (i): 1 - p^ 2ai (jlr)) geometry (ii): 1 - p^ ai (-f LS + jsr))- 

(3.3) 

If Ui = 1, split the ith partition element in its left and right locus. Altogether, 
this defines a partition 

vr' = {{vr, n £}, fa n r} : U t = 1} U {{vr,} : U t = 0}. 

Next, realize a Yule process with branching rate a, i.e., each line splits in two lines at rate 
q. Stop this process when it has [2a\ lines. Call this tree y. To obtain the genealogy of a 
sample of size \tt'\ from this tree with [2a\ extant leaves, we use the following construction: 

Start with |vr | lines from the full Yule tree y with [2a\ lines. When there 
are k lines left at the time the full tree has i lines, the probability that a 
coalescence event occurs among the k lines at the time the full tree goes from 
i to i — 1 lines is 

© (3.4) 
ffl" 

By this construction we build a tree 3^|-n-'| with the partition elements of ir' as 
leaves and nodes which record the number of lines in the full Yule tree. 

Remark 3.1. To construct the sample tree y^'i from y is & task equivalent to describing an 
exchangeable sample from a tree which arises by exchangeable binary coalescence dynamics. 
This has been studied by [STW84] and was recalled in [EPW06, Lemma 4.8]. If It = i is 
the number of lines in the Yule tree y at time t, denote by Ki the number of lines in 
while It = i. The process (-fQ)|2aJ>i>i is a time-inhomogeneous Markov chain with transition 
probabilities 

( k ) 

P[JQ_! = k - l\Ki = k] = -)ff , i = 2, . . . , [2a\ ,k = 2,..., |tt'|. 

(2) 

Moreover, the sample tree can be described forward in time by noting that 

rr'l-k + l 



F[K i = k\K i „ 1 = k-l] = L- 



□ 
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mark 


probability 


SL 
LR 
SLR 
no 


(1-^(7Sl)H 2 (7lk) 
^( 7S l)(1-^( 7lr )) 

(1-^(75l))(1-^ 2 (7l R )) 



Table 1: For geometry (i), we mark every branch in the Yule tree by at most one from three 
different kinds of events. If a branch starts when the full Yule tree has i\ and ends when it 
has %2 lines, the probabilities for all marks are given in the table. 

The sample tree which is pruned out of the full tree in this way represents the genealogy 
at the selected site. To describe the genealogies at the partially linked neutral sites we mark 
the sample Yule tree to determine further recombination events. A mark stands for one (or 
two) recombination events that may occur. This works in the following way: 

Let a branch in the tree J^'i be given which starts when the full tree has i\ 
lines and ends when the full genealogy has 12 lines. For geometry (i), every 
branch can be hit by at most one of three different kinds of marks indicating 
recombination events. These are SL-, LR-, and STii-marks. Their probabil- 
ities are given in Table 1. For geometry (ii) the branch is hit independently (3.5) 
by LS- and S7?-marks with probabilities (1 — p^i'jLs)) and (1 — ^ 2 (7sr))- 
Here, SX-marks separate the S- from the L-locus on each branch of the tree, 
etc. For geometry (i), SLiJ-marks separate the S- from the L- and the L- 
from the iZ-locus. 

Example 3.2. The above construction is illustrated in Figure 3. We consider geometry 
(i) here. A set t = {1,2,3,4} of L-loci and r = {5,6,7,8} of /2-loci is given. Starting 
with 7r = {{1, 5}, {2, 6}, {3, 7}, {4, 8}}, every partition element is split with probability 
according to (3.3). This results in the finer partition ir' . The partition elements of ir' are 
used to construct a sample tree from a full Yule tree which has [2a \ lines. The coalescence 
probabilities for the sample are given by (3.4). On the sample tree, branches are marked by 
SL-, LR-, or SXi?-marks according to Table 1. The resulting partition ir" is constructed as 
given in Definition 3.3. 

We are now in a position to define our approximation based on the Yule process. 

Definition 3.3. Assume £ and r are sets of left and right neutral loci, respectively, and 
ir G Viur- By (3.3) construct the partition ir' and by (3.4) and (3.5) a Yule tree with 
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2a lines 




Figure 3: The Yule process approximation for two linked neutral loci under a selective sweep. 
Here, we consider geometry (i). The L- locus may be traced back along dashed lines while 
dotted lines indicate ancestry of the -R-locus. See text for explanation. 
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marks. For geometry (i) define the equivalence relation: 



no 5L-, SLR-mark on 

71"', 

no SL-, LR-, SLR-mark on 
no SL-mark on \r 



ifj,ke e, 

ifj,k e r 



(3.6) 



no LR-mark on 



no SLR-mark on 



Y 
Y 



if j £ £, k £ r 



where the bold lines indicate for which part of the tree y\ n '\ relating two lines with the root of 
the tree, the constraint on marks applies. For geometry (ii) set 



j ~ k : 



no LS-mark on 



no SR-mark on 



no LS-mark on 



no SR-mark on 



i) "( 
I 

Y 



(3.7) 



if j £ £,k £ r 



Y 



('TTte equations (3.6) and (3.7) indeed define equivalence relations, as can easily be checked.) 
Each of these equivalence relations on £ U r defines a partition ir" . For geometry (i) there is 
a unique partition element 



[fa -i 



(3.8) 



f 

j £ £ : no SL-, SLR-mark on | | 

U | A; G r : no SX-, Li?- ; SLR-mark on 
and for geometry (ii) a unique partition element 

j £ £ : no LS-mark on | j U |fc G r : no SR-mark on | |. (3.9) 



T/ien t/ie random partition 



? n := (K}^"\K}) 
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is called the Yule approximation of T^. 

Example 3.4. For the example in Figure 3 the SL-, LR- and SLR-m&rks on the sample tree 
lead to the realization 

T ff = ({{3,4}} J {{l,2},{5 J 6},{7},{8}}). 
Theorem 1. Let n 6 Vt\j v and T n and be as in Definitions 2.2 and 3.3. Then, 

sup ip[r^ = e]-p[T 7r = e]i = oi 



Remark 3.5. 1. The Theorem states that, for large a, the random partitions and T n 
are close in variation distance. Here, variation distance refers to the maximal difference 
in the probabilities to obtain any partition £ £ "P^ Ur . The order of accuracy, given by 
the Landau symbol, still depends on several parameters. These are the cardinalities 
£ and r and recombination constants jsl,7lr for geometry (i) and ^ls and ^sr for 
geometry (ii). The proof of Theorem 1 will be given in Section 5. 

2. At first sight, comparing the Definitions 3.3 and 2.2 the Yule approximation does not 
look any simpler than the exact model. However, the Yule approximation has advan- 
tages both analytically and computationally. The random partition T n relies on con- 
structing a frequency path X, while the Yule approximation T n constructs the ancestral 
recombination graph for the sample directly. Analytically, as we will see in Section 4, 
this means that explicit calculations are possible. Computationally, i.e., for simulations 
of the ancestral recombination graph, the direct construction of the ancestry of the 
sample allows for fast algorithms; see [PHW06] for the case of a single neutral locus. 

3. The current paper is a generalisation of results found in [EPW06] for a two-locus system 
with only one neutral locus. More precisely, consider the projection of T n on only one 
locus, i.e., on either £ or r. In Propositions 4.2 and 4.7 of that paper it was shown that 
the projection of on I or r is an approximation to a structured coalescent with an 
error in probability of the order 0((loga) -2 ). 

4. In [EPW06] an approximate sampling formula was given in the two-locus case. A similar 
approach would be possible here. However, we refrain from its derivation because it 
was shown in [PHW06] that the sampling formula in the two-locus case only produces 
numerically sound results for n < 5. 

5. As indicated numerically in [PHW06], the Yule approximation can be improved. To 
understand how this works, we need to collect the errors which contribute to the error 
of order 0(1/ (log a) 2 ). First, the Yule approximation ignores events (2), (6^), (7) and 
(8). Second, as will be clear in the proof of Proposition 5.5, the coalescent rate in the 
beneficial background is decreased from 1/Xdt to (1 — X)/Xdt by the Yule process. 
It is the latter error that dominates, at least in large samples, because the total coa- 
lescence rate increases quadratically with the number of lines. However, increasing the 
coalescence probability in (3.4) to 

G) 1 - «r 

at the time the Yule tree has i lines corrects for this error. 
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6. For simulations of genealogies it is most important that the Yule approximation given 
above is not restricted to the case of two neutral loci. The take-home-message from the 
construction of the Yule approximation is that splits in the beneficial background are 
generated first and afterwards marks on a Yule tree determine all recombination events. 
Both, splits in the beneficial background and recombination events along the Yule tree 
can be given along a continuous chromosome. 

□ 



4 Application: D 

Lewontin's D is a measure of linkage disequilibrium (non-random association of alleles) and 
is frequently used as a simple statistic in a multi-locus setting ([Lew64]; see also [Ewe04, 
(2.89)]). Given two loci L and R with alleles or 1 at each locus, it is defined as 

D = Plr- PlPr (4.1) 

where plr is the frequency of individuals carrying allele 1 at both loci, pi is the frequency of 
l's at the L locus and pr is the frequency of l's at the R locus.. 

To predict patterns of D between pairs of neutral loci at the time T of fixation of a 
beneficial allele we next approximate E[Z?(T)] using Theorem 1. It is crucial to observe that 
E,[plr(T)] as well as E\pl(T)pr(T)] may be derived by the distribution of genealogies of 
linked neutral loci under selection and the expected allele frequencies at the beginning of the 
sweep. To see this, note that E[p£#(T)] equals the probability that the ancestors of the L- 
and i?-locus of one randomly picked individual from the population at time T carry alleles 
1 at both neutral loci. Analogously, E[pl(T)pr(T)] is the probability that the ancestors of 
the L- and R- loci of two different individuals at time T both carry allele 1. Denote by q the 
probability that both loci, L and R from one individual, picked at time T, have a common 
ancestor at the beginning of the sweep. Analogously, q' is the same probability for the L- and 
ii-loci from two different individuals. Using these definitions we see that 

E \p LR (T)] =q-E \p LR (0)] + (1 - q) ■ E\p L (0)p fl (0)] , 
E [pl(T) P r(T)] =q'-E \p LR (0)] + (1 - q') • E [ P l(0) P r(0)] . 

Combining (4.2) with the definition of D from (4.1), 

E[D(T)] = (q-q')E[D(0)]. (4.3) 

Both, q and q' may be approximated by Theorem 1. Formally, setting £ = {1}, r = {2}, 

rfi, 2j ur^ 2> = {{i,2}} 



{1,2} u 1 {1,2} 
r fl},{2} U r {l},{2} = U 1 ' 2 ii 



(4.4) 

B i i ro 



As r,,- may be approximated by T n this brings us in a position to predict patterns of D at 
the end of a selective sweep. 
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Figure 4: The effect of Lewontin's D under a selective sweep may be simulated in a Wright- 
Fisher model. In this process, the frequency path of the beneficial allele is stochastic and 
the ancestral recombination graph may be built conditioned on this frequency path. The 
locations of the L and R locus are fixed. The position of the selected site varies along the 
x-axis. If we compare the result from (4.5) to equation (47) of [SSL06] we see that the Yule 
process approximation is more accurate. The parameters of the Wright-Fisher model are 
N = 10 5 , a = 1000, p LR = 20 and D(0) = 0.0242. 

Theorem 2. For geometry (i), 

E[D(r)]=^(2 7M )(l-^^^^(2 7Si ))E[i5(0)] + o( Ii ^), (4.5) 

and for geometry (ii), 

E[2?(T)]=E[2?(0)].0(^^). (4.6) 

Remark 4.1. 1. Patterns of Lewontin's D can be studied by deterministic forward calcu- 
lations instead of our genealogical approach. This was carried out in [SSL06] under the 
assumption that strong selection leads to a deterministic behaviour of allele frequencies. 
Specifically, the frequency of the beneficial allele follows the logistic differential equation 

dX = aX(l - X)dt, x o = i? 

instead of the stochastic path given by (2.1). Predictions of D at all times during the 
selective sweep were given. In particular, their equation (47) approximates values of D 
at the end of the sweep for geometry (i). 

In real populations, random effects due to genetic drift are not negligible. This has been 
pointed out by [LSP06]. The Yule process approximation captures most random effects. 
Indeed, comparison with simulations from [LSP06] shows that the results produced by 
the Yule process approximation are more accurate than those of [SSL06]. 
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2. For empirical studies it is most interesting to know which patterns of linkage disequilib- 
rium to look for in real data. The pattern genetic hitchhiking can produce was discussed 
in [SSL06] and [RT06]. Surprisingly, hitchhiking reduces levels of linkage disequilibrium 
compared to the neutral expectation. This is evident from Figure 4. If the selected 
locus is far from both neutral loci, linkage disequilibrium between the neutral loci is not 
affected by hitchhiking. Therefore, values of D for large psl converge to the expectation 
of D under neutrality. This effect was taken up by [RT06] to argue that genetic hitch- 
hiking produces patterns in the association of alleles similar to recombination hotspots, 
which are e.g. important in genetic association studies in humans ([Con05]). However, 
genetic hitchhiking certainly produces patterns different from recombination hotspots in 
general, e.g., a low neutral diversity or a distinctive site frequency spectrum ([FWOO]). 

3. An accurate approximation of E[D(T)] does not suffice to predict patterns of linkage 
disequilibrium in general. In addition to genetic drift, random effects which affect D(T) 
were found in [SSL06] to be the allelic type of the founder of the sweep and its frequency. 
The resulting variance in D can be considerably higher than under neutrality. 

Now we come to the proof of Theorem 2. 

Proof. The key in the proof is to compute the probabilities q and q'. This is achieved by the 
Yule process approximation of Theorem 1. 

We start with geometry (ii). Here, we can see from the Yule approximation (3.7) that 
q = q' up to a term of order l/(loga) 2 since one L and one R locus are identical by descent 

iff there is no LS mark on y and no SR mark on j . It does not depend on the 

linkage of the L and the R locus at the end of the sweep. Consequently, (4.6) follows. 

For geometry (i), we start with the approximation of q'. For one L and one R locus from 
two different individuals there is a random number K of lines in the full tree of the Yule 
approximation at the time the selected loci which are linked to the neutral ones coalesce. To 
obtain the distribution of K, we compute 

which is a special case of [EPW06], (4.16). We read from (3.6) that the L and R locus are 
identical by descent at the beginning of the sweep if and only if (a) no mark or an SL mark 

\/2 \y2 l v 2 

falls on Y , (b) no mark hits and (c) no mark or an LR mark falls on \f 

Hence we compute 



2a 2 ( 1 

^j^pl(7LR)pl a (lSL)pl a (lLR) P l a (7SL) + O ^(7^)2 



(4.7) 
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For q we have to distinguish the cases where the L- and the i?-loci split or not. If they do not 
split, the L- and -R-locus have the same ancestor at the beginning of the sweep if and only if 

{1,2} 

there is neither an LR- nor an SLR-maik on . If they split, the probability of a common 
ancestor is q'. Therefore, 

q = pl a (lLR)vl a (lLR) + (1 - vl a (lLR))q' + O ( q^s) ■ ( 4 -») 

Hence 

E[D(T)] = pI^lrM^lr) ~ q')nD(0)] + O (^55) ( 4 -9) 

and the result follows. □ 



5 Proof of Theorem 1 

The proof deals with geometries (i) and (ii) simultaneously. We will write events at rates 
(3)-(8) whenever we refer to the rates (3j)-(8j) for geometry (i) and (Su)-(8n) for geometry 
(ii), respectively. 

We will be dealing with several random partitions all of which agree up to an error of 
order 0((log(a)) -2 ). Exactly, we will prove 

Prop. 5.2 Prop. 5.5 ^_ Prop. 5.6 

~ ^7T ~ "7T ~ Y^ 

where r '„- , ,3 n and Y,,- are given in Definitions 2.2, 5.1, 5.3 and 3.3, respectively and '~' 
means that the random partitions differ by O ((loga)" 2 ) in variation distance. 

While T n is the random partition which is defined by the structured ancestral recombina- 
tion graph, the other random partitions are approximations. First, A„- arises by (i) ignoring 
events which occur according to rates (2), (fin), (7) and (8) and (ii) realizing all events accord- 
ing to rate (5) first and only afterwards, construct the process using rates (1),(3),(4) and 
(6j). Second, already deals with the Yule process. It is derived by marking an infinite 
Yule tree by two constant rate Poisson processes with rates psl,Plr for geometry (i) and 
Pls,PSR for geometry (ii). Finally, the Yule approximation T n of arises by considering 
only the number of lines in an infinite Yule tree at times of coalescence in a sample. 

In the whole proof we rely on a probability measure P on a probability space on which the 
solution of (2.1) as well as arbitrarily many independent Poisson processes and other random 
variables are realized. 

Definition 5.1. Define a V' iLj r -valued random variable A n as follows: starting in ir G Viur 
split all partition elements £ E ir independently into ^n£,^Dr with probability 



l-E 



T 

exp ( — p ■ I X s ds 



(5.1) 



where p = plr for geometry (i) and p = Pls+Psr f° r geometry (ii). The resulting partition tt 1 
is used for the starting point (tt', 0) of a process rj X = (rj^ )o<p<T> conditioned on a frequency 
path X = (X t )o<t<T with transitions according to events (l),(3i), (6i), given by (2.3), for 
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geometry (i) and to events (1), (3u) and (4a), given by (2.4), for geometry (ii), respectively. 
Given n x , define 



Proposition 5.2. Let -k G Vi\j r and T n and A n be as in Definitions 2.2 and 5.1. Then, 



Proof. We proceed in several steps. Our arguments in Step 1 show that we may discard 
events which occur at rates (2), (6^), (7) and (8). In Step 2 we use a fixed number of Poisson 
processes to generate the random partition we want to approximate. Our goal is to separate 
events (5) from the rest by verifying a certain order of the possible events and establishing 
an approximate independence of the events (5). Particularly, we show in Step 3 that splits 
in the beneficial background (i.e., events (5)) take place before all other events with high 
probability. The approximate independence will be proved in Steps 5 and 6 by an application 
of a general result on mixed Poisson processes we establish in Step 4. 

Step 1 (Small probability of events (2), (6^), (7) and (8)) 

First, note that by Proposition 3.4 of [EPW06] events (2), i.e., coalescences in the wild-type 
background, have a probability of order C((log a)~ 2 ) . Furthermore, events (7) and (8) are 
back-recombinations into the beneficial background and hence have a probability of order 
0((loga)~ 2 ) as well. Additionally, for geometry (ii), events (6u), i.e., splits in the wild-type 
background, can only occur if a coalescence event (2) has happened before. As a consequence, 
we can discard events which occur at rates (2), (6a), (7) and (8) producing only an error in 
variation distance of at most 0((log a)~ 2 ) . 

So we are left with a 7 ? £ Ur -valued stochastic process conditioned on X, ( = (C/f )o</3<T> 
which arises by events (1), (3), (4), (5) and (6,), started in (q = (it, 0). 
Step 2 (Construction of ( x by Poisson processes) 

Recall that £ := \£\ and r := \r\ are the number of L and R loci under consideration. Take 
Poisson processes which are all conditionally independent given the random frequency path 
X of the beneficial allele. For coalescence, take a Poisson process 71 with 






(coalescence in the beneficial background) 



(5.2) 



at time (3; for recombination events take Poisson processes T 3i , T 4i , T 5i with 



rate ipsU^ - X T _ p ) 
rate rp LR (l - X T -p) 
rate rp LR X T _i3 



(rec. to the wild-type background) 

(rec. to or split in the wild-type background) 

(split in the beneficial background) 



(30 
(4i) 
(5i) 



(5.3) 



at time (5 for geometry (i) 



and Poisson processes T 3i - , T 4ii , T 5ii with 



rate £pls{~L ~ X T _p) 
rate rp SR (l - Xt-^) 
rate r(p LS + psr)X T -/3 



(rec. to the wild-type background) 
(rec. to the wild-type background) 
(split in the beneficial background) 



(3ii) 
(4u) 
(5u) 



(5.4) 
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at time (3 for geometry (ii). We have combined recombinations to the wild- type and splits in 
the wild-type background in case of geometry (i) since they happen with the same rates. 

Additionally, let W = (Wi, m )i=i,3,4,5,m=i,2,... be a random array such that all W^m's are 
independent, W^ m is uniformly distributed on all pairs of £Ur, W^ m is uniformly distributed 
on £, and W 4im and W 5jm are uniformly distributed on r, m = 1, 2, 

The set £ U r can be totally ordered, so we may assume that every partition element in 
C £ ^eur nas a smai l es t element. Recall that we write for the partition element containing 
j G £U r. 

We abbreviate by T 3 -T 5 the Poisson processes T 3i -T 5i for geometry (i) and the Poisson 
processes 1~ iu -T 5u for geometry (ii). We next show that the distribution of (?f is the image 
measure of the tupel (71, T 3 , T 4 , T 5 , W) under a map (p. Specifically, the distribution of is 
uniquely determined by the distribution of (T 1 ,T 3 ,T 4 ,T 5 ,W). 

To define ip, consider a discrete set T 1 C [0,T] and finite sets T 3 ,T 4 , T 5 C [0,T] such 
that Tjj n Ti 2 = for i ± / i 2 and set T = (Ji^i- Furthermore w = (wi,- m )i=i,3,4,5,m=i,2,... 
such that for all m = 1,2,..., w 1)Tn is a pair in £ U r, w 3i?n G £ and u> 4jm , tu 5jm G r. Given 
(T l5 T 3 , T 4 , T 5 , w) we generate a partition by considering the events in T in decreasing order. 
Assume Qq = (vr,0) and after the (m — l)st event at time (5 we obtain a partition C% = 

{C B i ( b ) ^ ^£ur anc ^ * ne m th event in T to be realized happens at time /3' G T. 

Consider first the case is the mth event is the m^st event in (3' G T L . The pair 
w 1/mi = (j,k) gives a random pair of loci. If C(j)>C(fc) £ C B an d if both, j and k, are the 
smallest elements of their partition elements, coalesce these partition elements, i.e., make the 
transition 

(c B ,C b ) — ((C B \{Ca),C( fc )})u{C 0) uC( fc )}, C b ) ■ 

Otherwise do nothing. 

The next case to consider is that (3' is the m 3 rd event in T 3 and w 3 ^ m) = j for some j G I. 
If C(j) ^ ( B an d if j is the smallest element of n £, change the partition element from Q B 
to ( b , i.e., make the transition 

(cV 6 )^(c B \{Ca)}, C b u{c 0) }). (5.5) 

Otherwise do nothing. The case t G T 5 is similar and is omitted. 

If /?' is the m 4 th event in T 4 and w 4 ^ ni4 = j for j G r the partition £ again only changes 
if j = min(Vj) nr. We distinguish two cases, G Q B and 0j\ G C 6 - I n the former case, split 
the L- and i2-loci in the partition element in two partition elements and bring all -R-loci into 
the wild- type background, i.e., make the transition 

(C B , C b ) — ((C B \ {C§)» u {Cgj n C b u {eg, n r}) . (5.6) 

This corresponds to an event (4). In the latter case split all L- and i?-loci of (Vj) and leave 
them in the wild- type background, i.e., make the transition 

(C B , C b ) — (C B , (C b \ {C (i) }) u {C0 n £, C 0) n r}) , (5.7) 

which corresponds to an event (6j). Recall that for geometry (ii) one L- and one i?-locus 
cannot recombine to the wild-type background together. Hence partition elements in £ & are 
either subsets of £ or of r such that the last transition must not occur for this geometry. 
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Figure 5: (a) A partition element (a line) is hit by an event taking both the L- and the 
R- locus to the wild- type background at time 0'. Afterwards, at time (3" the line is split in the 
wild-type background, (b) Here, the i?-locus is taken to the wild-type background at time 
(3' . Afterwords the L-locus is taken to the same background at time (3" . The outcome is the 
same. The line moves from the beneficial to the wild-type background and is split there. 

By generating all events according to this procedure we end with a partition . Therefore 
we have defined the map ip : (T 1; T 3 , T 4 , T 5 , w) i— > £t- 

The distribution of Ct * s the image measure of {T~ 1 ,T^T^T 5 ,W) under ^ o\ 
the map (p. 

To see this, observe first, that there are only finitely many recombination events (3), (4), (5) 
and (6j). Almost surely, all events in the Poisson processes occur at different times, so tp is 
defined on a set of probability 1. By the above construction, we obtain that two partition 
elements in Q B coalesce by event (1). The Poisson processes 71, T 5 , T 4 , T 5 produce exactly 
the recombination events (3), (4), (5) and (6,). Hence (5.8) is proved. 

Given w, the random partition ^(T l5 T 3 , T 4 , T 5 , w) only depends on the order of time 
points in T x , T 3 , T 4 , T 5 . There is another feature we will need: 

Let f3' , f3" be consecutive time points in T with (3' G T 3 , (3" G 
T 4 . Exchanging (3' and (3" does not alter the random partition 
p(T,, T 3 , T 4 , T 5 , w). Formally, if T n (/?', [3") = 0, T' 3 = T 3 \ {/?'} U 
{(3"} and T' 4 = T 4 \ {/?"} U {/?'}. Then (5.9) 

99(T X , T' 3 , T' 4 , T 5 , w) = v9(T l5 T 3 , T 4 , T 5 , w). 

Assume (3' is the m 3 rd event in T 3 , w i)mj = j and f3" is the m 4 th event in T 4 and w 4t m 4 = m. 
If j and k are not in the same partition element for (3 < (3', the claim is trivial as recombination 
events only make the partition finer. Similarly, if j > minGy) Di or k > rninQ^,) fir only one 
transition occurs and the claim follows. In the case 

C(j) = C(fc) , j = min C(i) n £, k = min n r 

two transitions occur if and only if = G ( B . We illustrate this situation in Figure 5. 

Observe that the two-step transitions for the pair ((5.5), (5.7)) (see Figure 5(a)) as well 
as for the pair ((5.6), (5.5)) (see Figure 5(b)) are given by 

(C B C 6 ) — (C B \ Co> C" u Ku) n e, c w n r}) , 
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i.e, the partition element both moves from Q B to and is split in its L- and R-\oc\. This 
proves (5.9). 

Step 3 (Probable order of events) 

Define e := ^ log °^ and T £ := min{t > : Xt = e}. We will show that (i) no coalescences, i.e., 



events (l), occur in [T e ,T], (ii) no splits in the beneficial background, i.e., events (5), occur 
during [0, T E ] and (iii) splits in the beneficial background, i.e., events (5) do not overlap with 
other recombination events (3), (4) with high probability. More precisely, we claim 



P[71 n [T e ,T] 7^0] = 
P[T 5 n[O,T £ ]^0]=O 
1 [minT 5 < max(T 3 U T 4 )] = O 



1 



(logo) 2 

(log «) 2 
a 
1 

(log af 



(5.10) 
(5.11) 
(5.12) 



First, (5.10) coincides with the assertion of Lemma 4.3 in [EPW06]. Second, for (5.11) 



we have X t < 



(logo) 



for all t <T e . Hence we get 



1 I X s ds 



P [T 5 n [0, T e ] = 0] = E 

> E [exp (-rp LR e T e )] > exp {-rp LR e E [T]) . 
By (3.1) we see that E [T] = + O (^). By the choice of e, this finally gives 

(loga) 2 ~ 



[T 5 n[0,T £ ] = 0] >l-0 



Third, for (5.12) we write, using p = O 
occurrence, 

P [minT 5 < max(T 3 UT 4 )] = 
E ' 



log a 



which might change from occurrence to 



< E 



[ P [T 5 n [0, t}^0\ max(T 3 U T 4 ) G dt, X] • P [max(T 3 U %) e dt\X 
Jo 

J (l - exp f- J pX s ds^\ V p(l - X t ) exp f- J p(l - X s )ds 



< P 2 - E 



f (1 - X t ) / 
io JO 



(5.13) 

The last term can be estimated using the Green function for the diffusion (2.1). As the right 
hand side of (5.13) coincides with the second line of (4.5) in [EPW06] we immediately obtain 
(5.12). 

In the next three steps we will show that realizing the different splits independently from 
a fixed sample path X = (Xt)o<t<T will cause only a small error. To see this we will establish 
a general result on mixed Poisson processes in Step 4 and apply it to the Poisson processes 
introduced in Step 2. The proof of Proposition 5.2 will then be concluded by an application 
of these two steps. 
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Step 4 (General approximations of mixed Poisson processes) 

Let {^(<5) : 5 > 0}, {$(5) : 5 > 0} be families of random variables taking values in 
Assume that the expectations E[^(5)], E[$(5)] are bounded in 5 and 



V[tt(tf)],V[*($)] =0(S) 



(5.14) 



as 5 — > 0. Denote the distribution function of the Poisson distribution with parameter A by 
Poi A (-)- We claim that for k, I G No 



E[Poi m (k)] 
E[Poi* (tf) (fc)-Poi ft(tf) (0] 



E[Poi m (k)] -E[Poi m { 



+ 0{5) 



Note that by a Taylor series approximation, for a random variable VP in 
moments and some ^? satisfying \I> — ~E\f?\ < |^ — E[^]|, 



(5.15) 
(5.16) 

with second 



E 









k\ 



d 2 
df? 2 



< e 

< 2V [*] 



E[¥] 



fc! 

fc-2 



*=E[¥] 



■ E 



k-l 



(tf -E[tf]) 2 
E[*] fc 



(fc-2)! {k-l)\ 



+ 



fc! 



■V[tf] 



(5.17) 



where the terms in {. . .} only show up if the denominators are non-zero and the last step 
follows from the fact that the Poisson weights in {. . .} lie in [0,1]. As this holds for every 
^(5), (5.15) follows immediately from (5.14). Moreover, by a calculation similar to (5.17), 



V[Poi* (tf) (fc)] 



E 



-29 (S) 



2kl 



!Y2 



E 



*(5y 
k\ 



0(V[*(8)])=0(S) 



Additionally, (5.16) follows easily from the fact that 

|E [Poi m (k) • Poi m (l)) - E [Poi m (k)] • E [Poi m (l)) 
= |Cov [Po%( 5) (fc) • Poi$ (5 )( 



< 



yjv [Poi m (k)] -V[Poi* (6) (0] = O (5) 



by the Cauchy-Schwarz inequality. 
Step 5 (Green function estimates) 

Set p = 7i5§^ where 7 = 'Jlr for geometry (i) and 7 = ^ LS + jsr for geometry (ii). Using 
our approximations from Step 4 we will show next 



\r 5 \ = k] 



[\(T 3 UT 4 )n[T £ ,T]\=k,\T 5 \ = l] 



(log a) 2 

[\(T 3 uT 4 )n[T £ ,T]\=k].¥[\T 5 



+ 



(log a) 



(5.18) 



(5.19) 



as a — ► 00. To see this, set 5 



(log a)' 



and define 



if? (6) = rp 



I X s ds, $(S) = (e + r)pf (1-X,)ds 

J0 JT E 
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Observe that for k = 0, 1, 2, . . . 



P[|T 5 | = fc]=E[Poi* (l5) (A ; )] 
[| (T 3 U T 4 ) n [r £ ,T]| =k]=E [Poi m (k)} 



(5.20) 



because T 3 , T 4 , T 5 are randomly time-changed Poisson processes. By (5.15) and (5.16), (5.18) 
and (5.19) follow once we have shown 











E 


p f (1 - X s )ds 


< E 


p / X s ds 




- JT e 




_ J 










V 


p{ (1 - X s )ds 


< V 


P / ^s^S 








. ■/ 



< 2 7 + O ( 1 

a 



O 



(log a) 



(5.21) 
(5.22) 



as a — > oo. 

First observe that (^t)o<t<T ^ as same distribution as (1— Xt-*)o<kt by time-reversibility 
(see e.g. [KT81, Gri03]). Hence the inequalities on the left hand side of (5.21) and (5.22) 
follow. Second, we verify the expressions on the right hand side of (5.21) and (5.22) by an 
application of the Green function G(., .) of the diffusion (Xt)o<t<T- This function satisfies 



E, 



g(X t )dt 



G(x,y)g(y)dy 



where E x [.] refers to the path (Xt)o<t<T with Xq = x and E[.] := Eo[.]. The Green function 
is given by 

(l_ e -a(l-»)^ 1 _ e -ay^ 



G(x,y) 

ay(l-y)(l-e- a )(l-e- ax ) 

see e.g. [KT81, EPW06] . More generally, satisfies 



if x < y 
if x > y, 



E, 



T i-T 
Jti 



gk(X tk ) ■ ■ ■ gi(X tl )dt k ...dti 



G(x,xi) . . . G(x k _ 1 ,x k )g 1 (xi) . ..g k (x k )dx k ...dxi 



for all k = 1, 2, . . . which can be proved by induction. We may thus write, because G(x, y) 
G(0, y) for y > x, 



V 



X s ds 



P 2 (2 



1 rl 



G(0, x)G{x, y)xydydx — 2 



2p J 



o Jo 

1 r x 



1 rl 



Jx 



JO 



G(0, x)G(x,y)xydydx < 2p 



G(0, x)G(0, y)xydydx 
rx 

G(0, x)G(x, y)dydx 



o Jo 



2p 2 Y[T] = O 



(log a) 



by (3.1) which gives (5.22). 
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Step 6 (Approximate independence) 

As we have seen in (5.8) the distribution of Ct ^ s determined by the distribution of the order 
of events in the Poisson processes T±, T 3 , T 4 and T 5 . The calculations in Step 3 allow us to 
make the assumptions 

71 n [T £ ,T] = 0, T 5 n [0, T £ ] = 0, max(T 3 U T 4 ) < minT 5 

on the ordering of events in these Poisson processes as these events have probability 1 — 
0((loga)~ 2 ). Furthermore, we know from (5.9) that events in T 3 and T 4 may be exchanged 
without changing the distribution of . Hence, the distribution of C?f is determined once 
the joint distribution of 



7ln[0,r e ], T 3 n[0,T e ], T 4 n[0,T £ ], |(T 3 UT 4 )n[T £ ,T]|, 
is known. To approximate the joint distribution of these objects, define 



7: £ :=7In[0,r £ ],i = 1,3,4 and K 3A := | (T 3 UT 4 )n [T e , T] |, K 5 :=\T 5 \. 
We will prove 

p o (n, Tf, r 4 £ , k 3A , k 5 ) - 1 = p o (r/, r/, t;, k 3i4 ) - 1 ® Poi 



+ 



(log a) 



(5.23) 

where PoX -1 is the image measure of the random variable X under P and the Landau symbol 
in this context gives the order in variation distance of the distributions. 

Once (5.23) is shown we conclude that K 5 is approximately independent of all other events. 
Furthermore, its distribution may be interpreted as the sum of r Poisson distributions with 
parameter IE pJ^X s ds . These determine the number of split events on all partition elements 
£ £ 7r with £ n r 0. A partition element splits, if it is hit by at least one split event. The 
probability for a split of a partition element is thus given, using (5.18) and (5.20) for k = 0, 
by 

l 



1 — exp I — p ■ E 



T 



X s ds 



1 - E 



cxp 



P 



T 



Xcds 



+ o 



■(logo) 2 - 

with p = plr for geometry (i) and p = pis + PSR for geometry (ii). Observe that T w is 
determined by the distribution of \T~f,T^,Tf, K iA ^j if K 5 is known. The random partition A^- 
is determined by the distribution of (Tf ,Tf , K 3A ^ independently of K 5 . So, Proposition 
5.2 is a consequence of the approximate independence of (T£ ,T 3 e , K iA } and K 5 given by 
(5.23). 

We write 



^o(T^T^T^K 3A ,K 5 ) 



[dX] 



[d(X t )o<t<Tz] 

J P(x t ) TE < t < T (K 3A , K,)- 1 P [d(X t ) T ,< t < T ] + O (^^) 
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where we have used the fact that T £ is a stopping time and the strong Markov property of 
the process X. Note that by (5.11) we may assume K 5 = \T 5 n [T £ , T] | which gives an error 

of Q I ( log a ) ) in probability. From Steps 4 and 5 we get 



/> 



{Xt)T£ <t<T 



?<t<T\ 



Poi K[(£+r)pfl(l-X B )ds] ® Fm E[r P flx s ds] +° \(\ oga ) 



Rewriting 

P0i E[(i + r)pf^(l-X s )ds] = j F (Xt) T e<t<T° [K^y^ldiX^Te^T] 

and using the strong Markov property of X a second time we get 
Po (T^T^T^K^Ks)- 1 = J P ( x t)o < t < TS o (T 1 e ,T 3 £ ,T/)" 1 P[ ( i(X i ) < 4 < T£ ] 

® / P (Xt) r e< t < T °(^3,4)" lp [^)^<KT] 



1 



(log a) 2 



and we are done. 



□ 



By Proposition 5.2, events (5) can be generated independently of the frequency path and 
of all other events. The rates of the recombination events (3),(4),(6j) at time (3 are all 
proportional to (1 — Xt-p)- This is reminiscent of the case of only one neutral locus, studied 
in [EPW06], where a line carrying one neutral locus in recombination distance p recombines 
to the wild-type background with rate p(l — Xt-p)- As a consequence we can use the same 
techniques used there, especially their Proposition 3.6. which states that a marked Yule tree 
approximately gives the same partition as the structured coalescent. 

Definition 5.3. Define a V^ Ur -valued random variable as follows: For all partition el- 
ements £ E 7r which £ n £ ^ 0, £ n r ^ 0, i.e., £ carries both left and right loci, split the 
partition element in its left and right loci, £n£, £Hr according to (5.1). Denote the resulting 
partition by it' . 

Let Y be an infinite Yule tree with branching rate a. Moreover, consider the random tree 
Yi^/i which arises by sampling \ir'\ lines from Y at infinity. Identify each of the \ir'\ partition 
elements ofir' with one sampled line. Between the root of the Yule tree Y starts and the time 
it has |_2aJ lines, mark all lines by the following procedure: 

For geometry (i), the tree is marked by Poisson processes with rates psl an d Plr- These 
marks are relabelled such that each branch is hit by at most one mark. Call the corresponding 
marks SL-, LR- and SLR-marks. The following rules are applied: 
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{j,k} e ^ b 



LR 



{j} {k} 




{k} e $ 



{j} {k} 



Figure 6: There are two possibilities how an SXi?-mark may occur. Here, SL and LR refer 
to points in the Poisson processes with rates psi and Plr- See text for further explanation. 



(a) If the Poisson process with rate psl puts the first (backward in time ) mark at time t from 
the root, start a Poisson process with rate plr and run it for time t. If an event occurs 
during this time, the branch is marked by an SLR-mark, otherwise by an SL-mark. 

(b) If the Poisson process with rate plr puts the first (backward in time) mark distinguish 
the following two cases: if the Poisson process with rate psl hits the branch as well, it 
obtains an SLR-mark. Otherwise, it obtains an LR-mark. 

For geometry (ii), mark the tree by two independent Poisson processes with rates pls and 
PSR- U a branch is hit by one or more events of the Poisson process with rate pis, it gets an 
LS-mark. If it is hit by one or more events with rate psr, it additionally gets an SR-mark. 

The result of this procedure is a marked Yule tree Yi^/i . Given ir' and the marked Yule tree 
Yi^/i we use the same equivalence relation as given in (3.6) and (3.7) to define ir" £ P£ Ur - 
Furthermore, we use (3.8) and (3.9) to define the random partition 

E„ := \ {*?}). 

Example 5.4. The two cases in which an SLR-mark occurs for geometry (i) are illustrated in 
Figure 6. Consider the line in the sample Yule tree which can be identified with the partition 
element {j, k} where j € £ and k £ r. Consider case (a) first, shown on the left side of Figure 
6: The SL-mark hitting a branch in Y|„-/ leads to a jump of the partition element into the 
wild- type background. We now have to consider the additional Poisson process at rate plr 
to determine whether or not the line will split within the wild-type background. If an event 
with rate plr occurs, the L- is separated from the R- locus on this line. Case (b) is illustrated 
on the right side of Figure 6. Here, the line which refers to the partition element {j, k} is first 
(backward in time) hit by an LR-mark, bringing the i?-locus into the wild-type background, 
and after that an additional SX-mark hits the same branch, which additionally brings the 
L-locus into the wild-type background. In both cases the loci j and k end up separated in 
the wild-type background. This is summarized in Definition 5.3 by an 5Li?-mark. 

As a next step in the Proof of Theorem 1 we now show that ~ H„-. 
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Proposition 5.5. Let ir G V'i Ur and A n and be as in Definitions 5.1 and 5.3. Then, 

sup \F[A W = £\-F[E n = Z]\ = o' 



Proof. As the mechanism to generate splits in the beneficial background is the same for both 
random partitions, A^ and S^, we concentrate on all other events. 

The proof follows along the lines of the Yule approximation in the case of only one neutral 
locus, given in [EPW06, Definition 3.3. and Section 4.3.]. The crucial observation is that by 
a random time change t ^> r given by dr = (1 — Xt)dt the frequency path X, given by (2.1), 
is taken to the solution Z = (Zt)t>o °f 

dZ = aZ coth(aZ)dt + VZdW (5.24) 

with a standard Brownian motion W and Zq = 0. This is an a-supercritical Feller branching 
process conditioned on non-extinction. It was shown in [E094] and [0'C93] that the genealogy 
of the a-supercritical branching process is a Yule process with branching rate a. Observe 
that the time-transformation t \— ► r only works until the supercritical branching process has 
reached frequency 1. From 4.5(b) in [EPW06] we see that at this time the number of lines 
in the Yule process is Poisson distributed with mean 2a. (The additional factor of 2 arises 
because we made the assumption that the individual offspring variance in the underlying 
Cannings model is 1 rather than 2. See also [PHW06].) However, as typical deviations in this 
Poisson distribution are of the order yfa. <Cawe may instead assume that the Yule process 
has [2aJ lines. This was made precise in the proof of Proposition 4.7. in [EPW06]. 

Moreover, for geometries (i) and (ii) the rates in the process £ change at time (3 from 
PSlQ-- Xt-p), Plr{1 ~ X T _p) to p S L, Plr and from p LS (l - X T _p), p SR (l - X T _p) to 
pLSi PSRi respectively. Especially, the time-changed rates are constant. Under the random 
time change the coalescence rate (1) changes at time (3 from l/X^-p to — Xf-p))- 

However, it was shown in [EPW06, Proposition 4.2.] that the change of these rates can only 
produce an error in probability of order O((logo)~ 2 ). This fact was used in [EPW06, Lemma 
4.5., Proposition 4.7.] to prove that the marked Yule process gives an accurate approximation 
in the case for one neutral locus. However, this result carries over to the present situation 
because all Poisson processes along the Yule process have constant rates. 

It remains to check whether the equivalence relation coincides with given the 
change in the coalescence rate has no effect. First of all, realize the splits in the beneficial 
background according to Definition 5.1. Then, take j,k G £ U r and trace their partition 
elements backwards up to time t = 0, (3 = T. We only consider geometry (i) and j G £,k G r, 
since the other cases j,k G I and j, k G r and all cases for geometry (ii) are similar. If 
we consider the process n x from Definition 5.1 without any recombination events we would 

obtain a tree for the genealogy relating j and k. However, recombination events 



may cause the L-locus j and the i2-locus k to end up in different partition element in the 
random partitions A,,-. This will be the case if and only if one of the following events occurs 
in the process r] X : 

(a) a recombination event (3j) with rate psl (1 — X) on , which takes either j or k to 

the wild-type background before coalescence, 



5 PROOF OF THEOREM 1 



26 



(k) 



(b) a recombination event (4j) with rate plr (1 — X) on , which takes k to the wild- 
type background before coalescence with j, 

(c) an event (4j) with rate — -X") on before (backward in time) an event with 

rate psx (1 — X) happens on that branch; in this case j and k have coalesced, but a 
recombination event brings k to the wild- type background without j, 

(d) an event (3j) with rate psl {1 — X) on before (backward in time) an event with 

rate plr (1 — X) happens on that branch, which brings both j and k to the wild-type 
background. Here, an event (6j) at rate Plr{^- — X) happens which splits j and k in the 
wild- type background. 

The trees in events (a)-(d) refer to trees generated by t) . By the random time change and 
our assumption that the change in coalescence rate does not alter random partitions we can as 
well take trees generated by the Yule process and change the rates psl(1 — X) and plr(1 — X) 
to Psl and plr- Hence we are dealing with a Yule tree with branching rates a marked by 
Poisson processes with rates psl and plr which is the exact situation of Definition 5.3. Using 
the definition of the SL-, LR- and SLiJ-marks, we note that 



(a) produces either an SL- or an SXi?-mark on 



(b) produces an Lii-mark on 



(*■) 



• (c) and (d) produce either an LR- or an SLR-m&rk on 

If none of these marks occur, j and k are in the same partition element of S^- by (3.6). Hence 
A„- and coincide with high probability. 

□ 

We conclude the proof of Theorem 1 by showing that 3 n from Definition 5.3 and T n from 
Definition 3.3 are close in variation distance. 

Proposition 5.6. Let it G Vf SJr and E w and T n be as in Definitions 5.3 and 3.3. Then, 

sup |P[H 7r = ^]-P[T 7r = ^]|=C / 



(log a 
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Proof. We will only consider geometry (i). The proof for geometry (ii) is analogous. 

After realizing the splits in the beneficial background first according to the probabilities 
given in (5.1) and (3.3), respectively, E T and are determined by the same equivalence 
relations (3.6) using the marks which hit the tree according to Definition 5.3 and Table 1. 
Hence our proof consists of two steps. First, we show that the probabilities given in (5.1) and 
(3.3) differ only by O ((log a)" 2 ). Second, we show that the error caused by generating the 
SL-, LR- and STiJ-marks using (3.5) instead of Definitions 5.3 is O ((log a) -2 ). 

Both assertions rely on the same calculation. Assume a line in the Yule tree starts when 
the full Yule tree has i\ lines for the last time and ends when the full Yule tree has i% > i\ lines 
for the last time. Additionally, the line is hit by a Poisson process with rate p = 7j3^j. The 
probability that the line is not hit by the Poisson process during the time the Yule process 
has i lines, i\ < i < i?, is 

ia 
ia + p 

because of competing exponential clocks. Analogously, the probability that the whole line is 
not hit, is, by a Taylor approximation, 

ia 



n ——=^\ e iQ g 1 



ia + p \ ^— ' \ ia + p , 

i=h+i •- 1 1 




Y . 1 . I +0 I n 1 ) (5.25) 



exp(--^- -)+0( w ^)=pf 1 (7) + 



(log a 



since the neglected terms in the Taylor series are of order C(p 2 /a 2 ) = C((loga)~ 2 ) and 
higher. 

To prove that (5.1) and (3.3) coincide approximately, observe that 



E 



exp ^— p ■ J X s ds^j = E exp ^— p - J (1 — X s )ds^j 



by the time-reversibility of X . Additionally, the right hand side gives the probability that a 
Poisson process with rate p(l — X) does not hit a line by time T. By the random time change 
dr = (1 — X t )dt this is approximately the same as the probability that a Poisson process 
with rate p does not hit one line in a Yule tree until it has \_2a\ lines and is hence given by 

Po (7)- 

Next, we are considering the generation of the SL-, LR- and 5Li?-marks along the Yule 
tree. The probability that more than one event with rate psl and plr hits the Yule tree 
during the time it has i lines is 

» 2 o< 1 



(ia + p) 2 \(loga] 

Hence we can ignore this event. Together with the Markov property of the Poisson process 
we see that the marks on different lines in a sample tree may be generated independently 
once the topology and the total number of lines in the full Yule tree is known. 
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Consider a branch which starts when the full Yule tree has i\ lines and ends when it has 
12 lines. Using Definition 5.3 this line is hit by an Si-mark iff it is hit by the Poisson process 
at rate psi arid an independent Poisson process with rate plr produces no mark between 
time and the time the Yule tree has 12 lines. Hence the probability for an SL-mark in r, n 
is approximately given by 

f 1 " ft -^—) (II-^)=(1-P^slMHilr) + o(-±^) 

V i Jl^ +1 ta + PSL J \fj[ la + PLRJ 1 V( lo g«) J 

If a branch is hit by the Poisson process with rate psi but did not obtain an SX-mark, it 
obtains an SL-R-mark. Hence the probability for such a mark is given by 

(1- n (i-nHM^-^M'MWTrV) 

\ t=h+i ia + PSL ) V t=i ia + pLR ) VOoga) / 

The branch is hit by an Lf?-mark if it is hit by the Poisson process at rate plr but not by 
the Poisson process with rate psi- Hence the probability for an Lii-mark is 

n HM 1 - n -^)=<(^( i -^M +o fn J ^) 

.±\ia + p S L\ la + PLR I 1 \ 11 J V( lo g«) / 

As a consequence, the marks in Yui and y^n coincide approximately (cf. Table 1) and 
we are done. □ 
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